Comparing generative and extractive approaches to information extraction from abstracts describing randomized clinical trials

Background Systematic reviews of Randomized Controlled Trials (RCTs) are an important part of the evidence-based medicine paradigm. However, the creation of such systematic reviews by clinical experts is costly as well as time-consuming, and results can get quickly outdated after publication. Most RCTs are structured based on the Patient, Intervention, Comparison, Outcomes (PICO) framework and there exist many approaches which aim to extract PICO elements automatically. The automatic extraction of PICO information from RCTs has the potential to significantly speed up the creation process of systematic reviews and this way also benefit the field of evidence-based medicine. Results Previous work has addressed the extraction of PICO elements as the task of identifying relevant text spans or sentences, but without populating a structured representation of a trial. In contrast, in this work, we treat PICO elements as structured templates with slots to do justice to the complex nature of the information they represent. We present two different approaches to extract this structured information from the abstracts of RCTs. The first approach is an extractive approach based on our previous work that is extended to capture full document representations as well as by a clustering step to infer the number of instances of each template type. The second approach is a generative approach based on a seq2seq model that encodes the abstract describing the RCT and uses a decoder to infer a structured representation of a trial including its arms, treatments, endpoints and outcomes. Both approaches are evaluated with different base models on a manually annotated dataset consisting of RCT abstracts on an existing dataset comprising 211 annotated clinical trial abstracts for Type 2 Diabetes and Glaucoma. For both diseases, the extractive approach (with flan-t5-base) reached the best \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document}F1 score, i.e. 0.547 (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pm 0.006$$\end{document}±0.006) for type 2 diabetes and 0.636 (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pm 0.006$$\end{document}±0.006) for glaucoma. Generally, the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document}F1 scores were higher for glaucoma than for type 2 diabetes and the standard deviation was higher for the generative approach. Conclusion In our experiments, both approaches show promising performance extracting structured PICO information from RCTs, especially considering that most related work focuses on the far easier task of predicting less structured objects. In our experimental results, the extractive approach performs best in both cases, although the lead is greater for glaucoma than for type 2 diabetes. For future work, it remains to be investigated how the base model size affects the performance of both approaches in comparison. Although the extractive approach currently leaves more room for direct improvements, the generative approach might benefit from larger models.


Introduction
The number of publications describing Randomized Controlled Trials has been increasing at an exponential pace for decades [1], thus making it more and more challenging to appropriately summarize the existing clinical evidence by way of systematic reviews.Yet, the ability to summarize the current clinical evidence is a core process to support evidence-based medical decision making [2].Indeed, the creation of systematic reviews is costly and time consuming as it is done manually by clinical experts with the result that systematic reviews and guidelines quickly become outdated after publication or are even outdated at the time of publication [3][4][5][6].Due to the effort associated with the creation of systematic reviews, there has been significant interest on the question how to automate their creation [7][8][9].Recently, approaches to automatically summarize clinical evidence by way of argumentative structures have been proposed [10].The bottleneck for such approaches is the missing availability of a database of semantically described clinical trials that comprise of structured representations of the key outcomes of each study.As argued by Sánchez-Graillet et al. [10], information extraction approaches have the potential to support the extraction of key information about the design and results of clinical trials from publications.These structured representations of the results of a trial in turn could support the process of systematic review creation or at least considerably reduce the effort to do so.
Most RCTs follow the PICO (Patient, Intervention, Comparison, Outcomes) framework for structuring the presentation of clinical research findings.As a result, early IE approaches in the clinical domain classify full sentences of RCTs [11,12] or smaller text spans [13] into the elements of the PICO framework.However, treating the PICO elements as flat objects represented as a collection of text spans does not reflect the complex information presented in RCTs for the following reasons: 1) the description of a single PICO element could be spread across several sentences and 2) the relationship between different PICO elements is not modelled (e.g. which outcomes belong to the intervention group and which ones belong to the comparison group).
Witte and Cimiano [14] have proposed an extractive information extraction approach that captures the design and key results of an RCT by way of 10 different templates that capture the PICO elements in a structured way, modelling dependencies and relations between them.These templates are based on the C-TrO Ontology that has been designed to support use cases related to the aggregation of evidence from multiple clinical trials [15].Those templates are instantiated with information from a given abstract describing the trial.For instance, a template Medication with slots DrugName, DoseValue and DoseUnit could be used to describe medications of intervention arms mentioned in a RCT.However, Witte and Cimiano [14] assume that the number of template instances (e.g.number of outcomes) is provided a-priori, which hinders the application of their approach in real world settings.Further, the approach of Witte and Cimiano [14] chunks the text into smaller segments and then combines the templates instantiated for each segment.This makes it difficult to capture relations that are mentioned across chunks.
In this paper, we build on the approach of Witte and Cimiano [14] and extend it in two directions.First, we rely on Longformers [16] and Flan-T5 [17] in order to encode the complete abstract, inferring template instances and slots jointly for the complete text.Second, overcoming the key assumption that the number of template instances are known a priori, we extend the approach by a clustering step that induces the number of template instances in an unsupervised manner.
Beyond the extractive approach, we also present a generative approach that is inspired in recent seq2seq architectures such as REBEL [18] or GenIE [19].These approaches rely on an encoder-decoder architecture by which the text is encoded and certain output structures are generated.We apply this idea to directly decode a complex nested template structure representing the design and key results of a study.As main novelties, we propose a decoding approach that relies on a grammar to guide decoding, ensuring that only valid structures are generated.Second, we present an approach to linearize the structure to be predicted such that it can be encoded as a sequence to be predicted by the generative approach.Our grammar-constrained decoding approach is inspired by Lu et al. [20], who also prune/ mask the vocabulary to consist only of elements which comply with the desired output format.The decoding mechanism presented in this work generalizes the output format specification to arbitrary right-linear context-free grammars.
We evaluate and compare both approaches on the dataset provided by Sanchez-Graillet et al. [21] and used in previous work [14], which consists of predicting 10 templates.The dataset comprises a total of 211 documents for two diseases: type 2 diabetes (104) and glaucoma (107).Our results show that the improved extractive approach using Flan-T5 as a base model performs best for both diseases in the dataset, achieving a mean F 1 score of 0.547 ( ±0.006 ) for type 2 dia- betes and 0.636 ( ±0.006 ) for glaucoma.However, both approaches have different strengths and weaknesses and are not yet suitable to fully automate the process of systematic review creation, but still have the potential to reduce the necessary effort a lot.
Additional data and evaluations (Appendix 2, 4 and 5) as well as the used grammar (Appendix 1) and a case study (Appendix 3) can be found in the appendix.
In summary, our contributions are the following: • We present an extension of the approach proposed by Witte and Cimiano [14] in two directions: i) relying on Longformers [16] and Flan-T5 [17] to encode the complete abstract and infer templates and slots for the complete document jointly, and ii) using a clustering step to cluster the extracted template instances to infer the number of instances for each template type.• We present a novel generative information extraction approach that relies on a grammar to guide decoding, and propose a novel serialization of the nested template structure such that the problem can be casted as a seq2seq inference problem.• We evaluate both approaches on the dataset by Sanchez-Graillet et al. [21] and show that our improved extractive approach using Flan-T5 [17] as a base model performs best for both diseases.
Some related work also deals with detecting clinical trial outcomes, outcome spans (e.g., Abaho et al. [37][38][39], Ganguly et al. [40]) or slot fillers (e.g., Papanikolaou et al. [41]) in (randomized) clinical trial abstracts.However, they lack the specific structure and dependencies of PICO templates and slots, which are used in this paper.These approaches mostly use transformer architectures, sometimes in combination with, e.g., LSTMs to detect the outcomes/slot fillers.
The PICO framework is frequently used to describe the results of RCTs in a structured way.This structure comprises of a number of templates and corresponding slots (which are uniquely assigned to a single template type).However, a RCT can contain multiple instances of a template, imposing the problem of matching recognized slot fillers with their corresponding template instance.Some efforts in this area focus on the problem that larger amounts of training data are missing or at least expensive to create due to the need for clinical experts as annotators.These approaches therefore utilize distant or weak supervision for training on noisy label data (e.g., Dhrangadhariya and Müller [42], Nye et al. [43], Wallace et al. [44], Liu et al. [45]).In contrast, the approach presented in this paper relies on the availability of sufficient classical supervised training data.
Several recent approaches use transformer models like BERT (Bidirectional Encoder Representations from Transformers, Devlin et al. [53]) for PICO recognition, but focus on different architectual and task-related details.
However, some approaches refer to PICO elements as flat classes, i.e. parts of sentences are just labeled, e.g., P or I, whereas our approach considers PICO elements to be nested structures, i.e. templates with slots that have to be filled with some portion of text.Examples for this simplified view on PICO elements are listed in the following: Schmidt et al. [54] treat the PICO recognition task as a sentence classification/question answering task and thus, in contrast to the approach presented in this paper, do not work on the level of whole documents/abstracts or PICO elements which span multiple sentences.Therefore, Schmidt et al. [54] do not benefit from contextualized representations utilizing the whole abstract as a context.Moreover, the problem of mapping found PICO elements to unique template instances is not dealt with.
Zhang et al. [55] propose a multi-step approach that first identifies P, I/C and O elements in the text using either Convolutional Neural Networks (CNNs) or Bi-LSTMs.After that, a Diseases Named Entity Recognition model is used to extract disease-related entities in the PICO-labeled sentences.Various different models, like, e.g., BERT-based or LSTM-based models, are compared in this category.Finally, a mapping model resolves some ambiguities, like intersections of recognition results for P and O. Again, different models (including both BERT and Bi-LSTMs) are evaluated for this task.Although this approach makes some efforts to create more structured results than flat sentence classification, it still ignores some aspects of the more complex structure of PICO elements.
Whitton and Hunter [56] propose a more structured view on PICO elements, e.g., by differentiating between two arms of a RCT.This is achieved in two steps by first applying a named-entity recognition model, recognizing three general types of entities (interventions, outcomes and measures).In a second step, they are then related to each other using a relation extraction model which also differentiates between the (up to) two arms of the considered RCTs.However, they focus on evidence tables, which are different from the nested template structure we work with in this paper.Moreover, the other approach does not work in a sequence-to-sequence manner with constrained decoding like the generative approach described in this paper.
Dhrangadhariya et al. [57] implement PICO recognition for more fine-grained entities, which -similarly to our approach -also consider more detailed information about participants, interventions and outcomes, like sample size, age, mortality, drugs or surgical interventions.Nevertheless, it is still less detailed than the template structure used in this paper, which consists of 10 templates comprising overall 85 slots (see Witte and Cimiano [14]).Moreover, by using BERT as an encoder and Bi-LSTM, self-attention as well as CRF and linear layers for classification, it does not work in a sequenceto-sequence manner like the generative approach we present in this work.

Methods
In this work, we address the problem of extracting a set of template instances from unstructured text.We tackle this problem from two different perspectives and present two approaches solving the same problem: 1) an extractive approach and 2) a generative approach.An illustration of both approaches can be found in Fig. 1.
The used data model captures the design and key results of an RCT by way of 10 different templates consisting of a total of 85 different slots that capture various aspects of the PICO elements in a structured way.These templates are based on the C-TrO Ontology that has been designed to support use cases related to the aggregation of evidence from multiple clinical trials [15].The mean number of slot fillers per template is shown in Table 1.A template t i is defined by a type i ∈ L and a set of slots S i = j s ij , where s ij denotes slot j of template t i , j this way denotes the set union over all slots j and L denotes the set of all template types.A template is instantiated by assigning slot-fillers to its slots, where a slot-filler can be either a text span from the input document or a template instance, depending on the slot.Figure 2 visualizes the used data model.In the following subsections, we describe the extractive and the generative approach in more detail.

Extractive approach
Our extractive approach is based on the Intra-Template Compatibility (ITC) approach [14], which adopts a twostep architecture: In a first step, all textual slot-fillers are extracted from the input document, followed by a second step, which assigns the extracted slot-fillers to template instances.The extraction of slot-fillers and their clustering and assignment are described in the "Extraction of textual slot-fillers" and "Assignment of textual slot-fillers to template instances" Sections, respectively.

Encoding of the input document
The ITC approach uses BERT (Bidirectional Encoder Representations from Transformers) [53] to compute a contextualized representation of each token w i of the input document d = (w 1 , . . ., w n ) .As the length of RCT abstracts typically exceeds the maximum number of tokens of most BERT implementations, the authors of ITC split the document into consecutive chunks and process each chunk separately.However, this approach treats each chunk as an isolated unit and hence the model is not able to learn token representations which incorporate the context of the full input document.Therefore, we adopt the Longformer [16] approach as well as the Flan-T5 model [17] to learn full-document contextualized representations h i ∈ R d (with d = 768 for both T5 and Longformer models) for each token w i of the input docu- ment, where d is the output dimension of the encoder of the respective model.

Extraction of textual slot-fillers
The ITC approach extracts slot-fillers from the input document by predicting start and end tokens of slot-fillers, followed by a step which joins the predicted start and end tokens.This is realized by training two linear layers which take the contextualized representation h i of the tokens w i as input and predicts whether or not this is a slot-filler start or end token, respectively: where S = i S i ∪ {O} is the set of all slots including the special no-slot label O which indicates that a token is not classified as a start/end token of a slot-filler.The vectors p s,i , p e,i denote the predicted probability distribution over the slots that a token w i is the start/end of the respective slots.The final prediction is determined by the arg max operation.
The predicted start/end tokens are joined sentencewise by minimizing the distance between start and end tokens in terms of tokens in between.More precisely, for a given sentence, we first collect all predicted start and end tokens.For each predicted start token w s , at position i we seek an end token w e at position j ≥ i with matching label and minimal distance to w s and assign it to w s as its end token.Finally, we discard predicted start/end tokens which have no matching end/start token.This slightly differs from the IOB format [58], as only start and end token of a sequence are tagged and all tokens in between (1)  are classified just like tokens which are not part of any sequence.A comparison of both tagging schemes can be found in Table 2.
For each extracted slot-filler i with start/end tokens w s resp.w e with corresponding token representations h s , resp.h e , ITC computes a representation e i by summing the representations of the start and end tokens followed by a dense layer with ReLU [59] activation function: The learned representations e i of the extracted slot-fill- ers (SFs) are then used as input to subsequent modules.In the remainder of this paper, we denote the set of all extracted slot fillers as E , where each slot filler in E is rep- resented by its vector representation computed by Eq. ( 3).

Assignment of textual slot-fillers to template instances
Typically, for some slot types like the textual slot fillers of the Outcome template, there are several slot fillers of the same type extracted from an original document.Therefore, (3) we need a way to group these slot fillers such that actual template instances, e.g.multiple Outcome instances, can be created from these slot fillers.Deciding which slot fillers belong together is however not a trivial task.
The assignment of extracted SFs to template instances is therefore done in ITC by a clustering approach per template based on a pairwise similarity or compatibility function q : R d × R d → [0, 1] .q scores the similarity between two SFs in the sense that they belong to the same template instance, where g(e i , e j ) = 1 indicates maxi- mum similarity such that e i and e j should be assigned to the same template instance.Note that e i and e j are entity representations calculated based on the contextualized embeddings generated by the used models.Thus, we can use results from the established field of (density-based) clustering to figure out the SF grouping.The similarity function q is implemented in a slightly more complex way compared to the original paper, using two linear layers with a ReLU activation function in between and followed by a sigmoid activation function: Note that due to the symmetry of + , also q is a sym- metric function, i.e. q(e i , e j ) = q(e j , e i ) for all pairs of e i , e j .Then the mean pairwise similarity between SFs of a cluster C i ⊆ E is given by The score of a clustering is the mean score of its cluster scores: The ITC approach seeks a clustering C * i of m i clusters which maximizes the score given by Eq. ( 7): where U i,m i denotes the set of all clusterings of the set E i with m i clusters.Note that the optimization objec- tive defined by Eq. ( 8) is parameterized by the number of clusters m i .In order to alleviate the assumption that the number of instances of templates needs to be known a priori, we propose a clustering step to induce the number of template instances per template type using Hierarchical Agglomerative Clustering (HAC) with a threshold based on the average of values computed for the training data, namely:

• the average similarity values of pairs belonging to the same template instance • the average similarity values of pairs belonging to different instances
After the clustering C * i (m i ) has been estimated, the template instances t ij are derived from those clusters The slot to which a SF e k ∈ C * j is assigned is given by the label assigned by the SF extraction module by Eqs. ( 1) and (2).In summary, the assignment of SFs to template instances is done as follows: 1.For each template t i , the set E i ⊆ E of SFs which can be assigned to instances of template type t i is estimated.2. Equation (8) or Agglomerative Hierarchical Clustering is used to find some clustering of the SFs in E i . (4) q(e i , e j ) = σ (w T s (q ′ (e i , e j )) 3. The template instances are derived from the clusters in the clustering.
Given these similarities and a clustering threshold of, e.g., 0.5, this results in two clusters which can be then directly used to create the corresponding Outcome template instances.These two clusters are: 1. PercentageAffected: 16 and TimePoint: week 24 2. PercentageAffected: 8 and TimePoint: week 12 The clustering thus provides a robust and flexible way to both determine the number of template instances to generate as well as the groups of slot fillers those instances comprise.

Generative approach
In this section we propose a simple generative approach for extracting template instances from unstructured text based on the Transformer [60] encoder-decoder model.As encoder-decoder models require the output to be a linear token sequence, the set of TIs needs to be converted into a sequence of tokens.In Section "Linearization of sets of template instances", we present a simple recursive method for linearizing sets of TIs along a context free grammar (CFG) for describing the linearized structures.In Section "Decoding" we adopt the presented CFG for generating valid token sequences representing sets of TIs.

Transformer-based encoder-decoder models
Transformer-based [60] encoder-decoder models are seq2seq models which haven been used on a variety of natural language processing tasks like machine translation [61] and text summarization [62].The encoder part of the Transformer learns a contextualized representation of the input tokens w 1 , . . ., w n via multi-headed self-attention [60], converting the input sequence into a sequence of vectors h 1 , . . ., h n ∈ R d , where d is the dimension of the Transformer model.Then the decoder part takes the vector sequence from the encoder as input and produces an output vector sequence The computational complexity of self-attention grows quadratically with the number of tokens.Beltagy et al. [16] proposed the Longformer encoder-decoder, which combines local and global multi-headed self-attention in the encoder, reducing computational complexity from O(n 2 ) to O(n).
The output vector sequence d is used to compute a probability distribution over the vocabulary of the underlying model via the following equation: where v i ∈ R d is the embedding of token y i , b i is a bias for token y i , d t−1 is the output vector of the decoder at posi- tion (t − 1) and d is the model dimension.The probability of token y t at position t is conditioned on the input token sequence x and the past decoded tokens y 1 , . . ., y t−1 .This dependence is encoded through the vector d t−1 via multi-headed self-and cross-attention.
Token prediction in the decoder is done by maximum a posteriori probability (MAP) inference.Hence the predicted token at position i is given by the token with maximal posterior probability: The generative model is trained via teacher forcing by minimizing the cross entropy loss between the predicted token distribution described by Eq. ( 9) and the ground truth label.

Linearization of sets of template instances
As encoder-decoder models expect the output space to be token sequences, we present a simple recursive linearization procedure of template instances (TIs).First, note that TIs are described by the content of their slots (i.e., their slot-fillers), and that slot-fillers can be either text spans from the input document or other TIs.Hence the recursion base is given by the linearization of textual slot-fillers.Let f = w k 1 , . . ., w k m be a (9) p(y i |x, y 1 , . . ., token sequence which represents a textual slot-filler f for a slot of name SLOT.Then the linearization of this slot-filler is the token sequence itself enclosed by the special tokens [start:SLOT] and [end:SLOT], i.e.
where ⊙ denotes the concatenation of tokens.If the slot-filler is a TI, then it is recursively linearized and the resulting token sequence is enclosed by the special tokens [start:SLOT] and [end:SLOT].The linearization of TIs is described below.
In general, more than one slot-filler can be assigned to a slot of a TI.Therefore, we denote the complete content of a slot as a set F of slot-fillers.As sets, in contrast to sequences, are unordered constructs by definition, the linearization of sets of slot-fillers is inherently ambiguous.To get an unambiguous order, we introduce a slot ordering operator ω which converts sets of slot-fillers into sequences of slot-fillers according to predefined criteria (e.g.position within input document in case of textual slot-fillers).Then sets F of slot-fillers are linearized as follows: First, we sort the elements of F according to the sorting operator ω and obtain a sequence F of slot- fillers.Then we linearize each slot-filler in F as described above and concatenate the resulting token sequences, respecting the ordering of slot-fillers in F.
Next, we describe the linearization of TIs.As TIs are represented by the content of their slots, the linearization of a TI has to include the linearization of its slots.However, a template does not impose any ordering of its slots, and hence the linearization order of the slots of a TI is undefined.Therefore, we introduce another ordering operator which orders the slots of a template.Then the linearization of a TI is the concatenation of the linearizations of its slots according to the ordering of its slots given by the ordering operator .
Any set of TIs induces a graph with TIs as nodes and links between TIs as edges.Recall that there is a link from TI t ij to TI t kl iff t kl is a slot-filler of t ij .In order to guaran- tee that the linearization algorithm described above is well defined, we require the induced graph to be 1) acyclic and 2) connected.The first requirement ensures that the linearization algorithm terminates, while the second ensures the absence of isolated TI, which can not be linearized.
However, choosing ω and is only necessary for train- ing but not for inference purposes, as the decoding allows to fill template slots in any order.Therefore, we choose arbitrary but fixed ω and for the experiments described in the "Experimental results" Section.
A full example for a whole linearized publication template instance can be found in Listing 2 in Appendix 6.A shorter example for an intervention template instance with both textual and template slot fillers can be found in Fig. 3.

A context-free grammar for describing linearization of sets of template instances
In the following, we describe the linearization of sets of TIs (described in Section "Linearization of sets of template instances") by a context-free grammar (CFG) which is used in the decoding process ("Decoding" Section) to constrain the generation of tokens.A CFG is defined by a 4-tuple G = (N , T , R, S) , where N is a set of non-terminal symbols, T is a set of terminal symbols, R is a set of production rules and S ∈ T is the start symbol of the grammar.The set of terminal symbols is defined by the vocabulary of the underlying encoder-decoder model together with some special tokens for defining the production rules R. The recursion base of the linearizations of sets of TIs is given by the linerization of textual slots which we describe by the following equation: where TEMPLATE and SLOT are placeholders for names of template and slots, respectively, TEXT is a placeholder for any token sequence from the input document and [start:SLOT], [end:SLOT] are special tokens enclosing the textual slot-filler.Eq. ( 11) schematically defines production rules for textual slot-fillers, and TEMPLATE is the non-terminal symbol which is used to identify the respective production rules.Note that the non-terminal symbol TEMPLATE on the right-hand side of Eq. ( 11) allows recursion and hence the application of more the one production rule associated with the Analogously to the production rules defined by Eq. (11) for textual slot-fillers, the production rules for slots containing TIs as slot-fillers is defined by where TEMPLATE_HEAD is a placeholder for any nonterminal symbol whose associated production rules are derived from Eq. ( 13).Listing 1 shows the production rules for the data model used in our experiments.

Decoding
In Section "A context-free grammar for describing linearization of sets of template instances", we presented a CFG which describes valid token sequences representing a set of TIs.In this section, we describe a simple method to constrain token prediction such that only such token sequences are generated which are valid according the CFG.For example, consider a slot Drug which can have textual slot-fillers for describing drug names for a medication.After the special token [start:Drug] has been predicted, we know that the set of next possible tokens would consist of all tokens from the input document plus the special token [end:Drug].This information is encoded by the CFG, and the decoding method described in this section uses this information to constrain token prediction.
In this paper, we slightly generalize the constrained decoding approach of Lu et al. [20] to arbitrary rightlinear CFGs by applying a strategy similar to recursive descent parsing.
Beginning with a start symbol, in our case PUBLICA-TION_HEAD, the set of possible next tokens is calculated in each decoding step.This set is then used to generate a mask for the model vocabulary to discard all tokens which would not comply with the production rules of the CFG.From the remaining tokens, we select the token with the maximum value in a greedy fashion.The implementation of a beam search to optimize the decoding output even more remains for future work.
To keep track of the decisions and possible next tokens, a stack data structure is used to guide the decoding.Whenever a start token of a slot like [start:NumberAffected] is chosen as the decoded token, this decision is saved by adding this to the decoding stack.This is then used to constrain the tokens in the next step to be only those which can follow a [start:NumberAffected] token.Similarly, when an end token like [end:NumberAffected] is chosen, the top stack element is removed from the stack.This way, the decoding is guided to comply with the requirements imposed by the CFG and this way ensuring the output can then be parsed into actual TIs.

Experimental results
In this section, we discuss the setting of our experiments as well as the results of those experiments.

Experimental setting
In our experiments, we use the same dataset as Witte and Cimiano [14] for type 2 diabetes and glaucoma.The dataset comprises a total of 211 documents for two diseases: type 2 diabetes (104) and glaucoma (107).The 104 type 2 diabetes documents are split up into a training, validation and test sets of size 68, 16 and 20, respectively.Analogously, the 107 glaucoma documents are split up into a training, validation and test sets of size 69, 17 and 21, respectively.We use the same fixed train-validation-test split and run separate experiments for those two diseases.Both the extractive and the generative approach were then evaluated using multiple base models, namely allenai/longformer-base-4096 [16] for the extractive approach and allenai/ led-base-16384 [16] as well as google/flan-t5-base [17] for both approaches.As the extractive approach requires just an encoder whereas the generative approach needs a decoder due to its seq2seq nature, we compare two encoder-decoder models from which only the encoder is used in the extractive approach.Additionally, we also evaluate an encoder-only model for the extractive approach to ensure the partial usage of the encoder-decoder models does not harm the performance.
For these models and diseases, we then run hyperparameter optimizations using Optuna [63] with 30 trials each and measuring performance using validation F 1 scores.In each trial, an initial learning rate (between 1e −3 and 1e −5 , using logarithmic domain) and a for the lambda learning rate scheduler (between 0.9 and 1.0, using logarithmic domain, learning rate calculated with lr(epoch) = epoch ) are sampled from Optuna.The used batch size is 1 and the number of epochs is 50 in all experiments.Each experiment is then run on a single NVIDIA A40 GPU.The best hyperparameters for each disease-approach-model-combination are then used to train 10 additional models.Unless stated differently, mean and standard deviation in tables refer to the different results of these 10 training runs.The means and standard deviations of the test F 1 scores of these 10 trained models are listed in Table 4 for each combination.

Slot-filler extraction results
In all categories, the extractive approach paired with the flan-t5-base model performs best.In summary, for glaucoma, the extractive approach performs best with model flan-t5-base and a mean test F 1 score of 0.636 ( ±0.006 standard deviation across the 10 training runs with the best found hyperparameters of the category).This way, it outperforms the other tested models of the extractive approach as well as all models of the generative approach by 0.02 or more.For type 2 diabetes, the extractive approach performs best as well with model flan-t5-base and a mean F 1 score of 0.547 ( ±0.006 standard deviation).This indicates that the extractive approach is superior to the generative approach, although the lead is much smaller for type 2 diabetes than for glaucoma.
Table 5 shows the mean F 1 scores per template on the type 2 diabetes and glaucoma test set.The table shows the values of the best models of each category (w.r.t.validation F 1 score), i.e. the flan-t5-base models in all four cases.The mean F 1 values are calculated for each of the 10 models trained using the best hyperparameters of their respective category.The values in the table correspond to the mean and standard deviation of those mean F 1 scores per template.The generative approach performs better than the extractive one on the Medication templates (0.48 vs. 0.34 and 0.62 vs. 0.53 F 1 score for type 2 diabetes and glaucoma, respec- tively).On the Population and Outcome template, the results are mixed with one approach performing better for one disease dataset but not for the other.On all six remaining templates, the extractive approach performs better, although with different margins.
Mean F 1 scores per slot are shown in Table 7 in the Appendix 2, again with mean and standard deviation (of the mean F 1 scores) calculated for the 10 models trained using the best hyperparameters of their respective category.The F 1 scores of the different slots range from over 0.9, e.g.PMID or PublicationYear, to below 0.1, e.g.FinalNumPatientsArm or ObservedResult.There are also some noticeable differences between the diseases, with Journal achieving scores of 0.96 and 0.92 for type 2 diabetes in contrast to 0.67 and 0.74 for glaucoma.There are also slots where one approach performs better than the other across both datasets, e.g.DoseUnit (0.77/0.8 generative vs. 0.24/0.6extractive) and NumberPatientsCT (0.65/0.65 generative vs. 0.93/0.86extractive).

Joint training on both datasets
Additionally to the main experiment described above, we ran another small experiment, training the bestperforming generative and extractive model (flan-t5-base in both cases) with the best-performing respective parameters in 10 trials on the union of the type 2 diabetes and glaucoma training, validation and test datasets, respectively.The resulting models are then again evaluated on the separated datasets for comparability reasons.The resulting mean F 1 scores ( ±σ ) for the generative approach are 0.556 (± 0.026) for type 2 diabetes and 0.626 (± 0.015) for glaucoma.For the extractive  approach, the mean F 1 scores ( ±σ ) are 0.560 (± 0.007) for type 2 diabetes and 0.644 (± 0.008) for glaucoma.Therefore, the performance increases for both datasets and both approaches compared to the original results trained on the separated datasets.Moreover, the generative approach achieves comparable performance to the scores of the extractive approach trained on the separated datasets.At the same time, the extractive approach gets even better when also trained on both datasets at the same time.
Considering the relatively small datasets, this might indicate that performance for both diseases benefits from similar data in the other dataset, respectively.Therefore, we are optimistic that the training of a single general model (in contrast to specialized models for each disease as described in the main experiment) is possible with comparable or even better performance on diseases the model has been trained on (i.e., in-distribution data) and acceptable performance on different but similar diseases (i.e., out-of-distribution data).However, another dataset would be necessary to test this hypothesis such that this remains to be investigated in future work.

Inferred template cardinality results
In this section, we evaluate the ability of our models to infer the correct number of instances for each template type.For this, we compare the number of inferred templates to the number of instances in the gold standard by computing the mean abolsute deviation.Table 6 shows the mean absolute deviation between the ground truth and predicted template cardinality of the best extractive and generative model on the type 2 diabetes and glaucoma test sets.The mean absolute deviation values are calculated separately for each of the 10 models trained using the best hyperparameters of their respective category.The values in the table are then mean and standard deviation of those mean absolute deviations across the respective 10 trained models.Additionally, in Appendix 5, the corresponding mean ground truth (GT) and predicted template cardinalities are listed in order to allow a judgement whether or not a certain deviation is high.Note that the templates Publication, ClinicalTrial and Population are not mentioned in these tables as their cardinality is always one.
On the type 2 diabetes dataset, the extractive approach yields better results than the generative approach in terms of template cardinality prediction for the DiffBetweenGroups, Endpoint and Medication templates, whereas the generative approach yields better results for the Arm, Intervention and Outcome templates.On the glaucoma dataset, the generative approach performs better than the extractive one in terms of cardinality inference on all templates except DiffBetweenGroups (0.39 vs. 0.17) and Endpoint (2.91 vs. 0.35).

Discussion
The overall slot-filler extraction results of both models in terms of micro F 1 measure indicate that the extractive approach is slightly superior to the generative approach, although the margin is especially small for the type 2 diabetes dataset (cf.Table 4).Moreover, the mean F 1 scores per template (Table 5) suggest that the extractive approach performs better than the generative one on most templates on both datasets.
However, the full picture is a little more complex and both approaches have areas in which they perform better or worse than the other one and vice versa, and that for a variety of reasons.
First, it is noticeable that the F 1 scores for glaucoma are, on average, higher than those for type 2 diabetes.Nevertheless, the difference between the results for both datasets is not the same for both approaches, although the trend is the same.For the generative approach, the performance of the best-performing flan-t5-base model decreases by just 0.045 (around 7.7% relatively) and the led-base-16384 version even increases its mean performance.
In contrast, the best-performing extractive version, again flan-t5-base, loses 0.089 (around 14% relatively) in terms of F 1 performance -relatively almost twice as much as the generative approach.This may indicate that the extractive approach is better able to exploit certain characteristics which are specific to the glaucoma dataset and which are not present in the type 2 diabetes dataset, whereas the generative approach is more robust against those differences -both in a positive and in a negative way -and that way maybe generalizing a little more due to the more complex nature of the seq2seq task.However, it is not clear which properties of the data cause this deviation.
Considering robustness and the different complexity of the tasks of the extractive and generative task, this is to some degree also mirrored by the standard deviations of the two approaches.While the standard deviation for the extractive approach is not greater than 0.01, the standard deviation of the generative models is not smaller than 0.025 and gets up to 0.106 for led-base-16384.Therefore, it is more than doubled at least compared to the extractive approach.
Moreover, the standard deviation appears to be correlated to the chosen model, with flan-t5-base giving the lowest deviation, followed by (for the extractive part) longformer-base-4096 and finally ledbase-16384 consistently across both datasets.
The different strengths and weaknesses of both approaches become even more apparent examining the different performances separated by templates (Table 5) and, ultimately, single slots (Table 7 in the Appendix 2).
For whole templates, Table 5 shows an in parts mixed picture of which approach performs best.In many cases in which the extractive approach performs best, both approaches perform similarly well (e.g., Publication).However, there are also different cases like Clinical Trial where the margin is larger, but also Medication where the generative approach outperforms the extractive approach by around 0.1 although the standard deviation is also quite high for the generative glaucoma case.In other cases there are large differences between the two datasets, which is also true for the evaluation per slot.
As an example for unexpected single slot differences, consider the Journal slot.One would expect the recognition of the Journal slot to be a comparably simple task across both datasets.However, the performance greatly differs between the datasets, although both approaches achieve good scores on this slot.For the type 2 diabetes dataset, the performance is nearly perfect with scores above 0.9.In contrast, the scores for the glaucoma dataset are still good but much worse with scores around 0.7.The different possible slot fillers are shown in Table 9 in the Appendix 4. Looking at the different slot fillers, it is not immediately clear why the diabetes case is so much easier for both approaches than the glaucoma case.Both tables have approximately the same number of different entries and in both cases the journal names are in many cases trivial to recognize (containing either Diabetes or Ophthalmol).
However, the distribution of occurrences might partially explain the performance differences here.Although both datasets have similar number of Journal slot fillers with up to three occurrences, only the type 2 diabetes dataset has (even multiple) Journal slot fillers with a high number of occurrences (more than ≈ 8 , e.g.).There- fore, the reason why the Journal slot appears to be so much easier to recognize in the type 2 diabetes dataset might not be due to the textual form of the slot fillers but instead because fewer slot fillers account for a larger majority of the general slot occurrences compared to the glaucoma dataset.The absolute numbers and differences are still quite small, however, but this might allow to get much better scores just by recognizing two or three Journal slot fillers.There may be many more examples which are not discussed here.
All in all it is not clear in all cases what properties of the data cause those partial differences in performance.However, it underlines on the one hand how much data variance can influence information extraction approaches like the two presented ones.On the other hand, this also emphasizes how both approaches can have different strengths and weaknesses and a flat evaluation only considering the final single performance score does not do justice to the complex nature of the task.

Case study
Similarly to the work by Witte and Cimiano [14], we conduct a case study on a single RCT abstract in which we compare the predicted and ground truth results for one exemplary document out of the type 2 diabetes test dataset.For this case study, we use the same publication as considered by Witte and Cimiano [14] which is the one by Shankar et al. [64].The results of this case study can be found in Table 8 in the Appendix 3.
Both the extractive and the generative approach succeed in extracting the basic characteristics of the trial which are part of the Publication template, e.g.authors, title and publication year.This is consistent with the results of Table 5, which indicate that Publication is an especially easy template to extract.Similarly, the ClinicalTrial and Medication instances are, except some small errors, extracted almost perfectly.The template instance for the used Intervention is also extracted without errors by both approaches, which is a little more surprising taking into account the slightly lower score of around 0.6.Moreover, both approaches correctly predict that there are no textual slot fillers of the Arm template in the text.
For the Population template instance, we first encounter moderate differences to the gold standard.Although both approaches manage to extract USA as slot fillers for the Country slot, both fail to extract the second slot filler Australia as well as Ethnicity.
The latter is at least in line with the fact that the first gold standard precondition -mentioning the ethnicity of the patients -is not recognized by both approaches.For the second Precondition slot filler, both approaches get a part of it but not the full slot filler, with the generative approach recognizing a slightly larger part of the actual slot filler.This is to some degree unexpected, as the mean performance of the extractive approach on the Population templates of the type 2 diabetes dataset is more than twice as high as the score of the generative approach.
For the DiffBetweenGroups template, the extractive approach returns a perfect result in this case, whereas the generative approach misses the P < 0.001 slot filler but delivers a duplicate of the P = 0.013 slot filler.The mean results of Table 5 suggest similar performance, which is not the case here.
For the Endpoint template instances, both approaches manage to extract most slot fillers at least partially but show issues grouping them together correctly.The extractive approach puts all of the extracted slot fillers in just two instances, missing most instances of the gold standard.For the generative approach, however, it is the other way around and too many instances (containing some duplicates) are generated.Nevertheless, some of the generated instances are correct and in some cases there is just a part missing.Generally, the performance is rather unsatisfying here but is consistent with the comparably poor mean performance of around 0.4 on the Endpoint template, indicating this is an especially hard template to extract.
However, the situation is even worse for the Outcome template instances, which was to be expected considering the mean performance on the type 2 diabetes dataset of just 0.2 and 0.11 for the generative and extractive approach, respectively.Again, both approaches at least partially recognize most slot fillers, but fail to group them together correctly.Similarly to the Endpoint template instances, the extractive approach generates too few instances whereas the generative approach generates more instances.Nevertheless, those instances are not entirely correct in most cases.This suggests future work has to improve this grouping beyond simple similarity calculations or fully relying on the language model and constrained decoding.
Taken together, the current results, while promising, are not accurate enough to support the full automatic creation of a systematic review as proposed by Sanchez-Graillet et al. [10].However, the proposed approach could considerably reduce the workload for teams to extract key information from a set of publications in the sense proposed by Thomas et al. [65].The results, however, would need to be manually controlled.While the approach is not yet suited to support the full creation of a systematic review at high-quality, it could be used to summarize the existing literature in a costeffective fashion to allow researchers to get a first overview of existing clinical evidence or as a basis to form hypothesis to be validated further on.

Conclusion
We have presented an extended extractive and a generative approach for extracting structured information from Randomized Controlled Trial abstracts, which can both support clinicians in finding best therapies on the basis of clinical evidence and in creating systematic reviews of the whole body of available clinical evidence.The extractive approach is realized by a two-step architecture which first extracts slot-fillers from the input document, followed by a clustering step which assigns the extracted slot-fillers to template instances.The best models of this approach yield an average F 1 score of 0.547 on type 2 diabetes and 0.636 on glaucoma test sets, respectively.In the generative approach, the structured information given by the template instances is encoded as a linear token sequence which is decoded at inference time by utilizing a context-free grammar for guidance.The best models of the generative approach yield an average F 1 score of 0.539 on type 2 diabetes and 0.584 on glaucoma test sets, respectively.
Future work should investigate whether the lead of the extractive approach persists when the base models of both approaches are scaled up, e.g. by using flan-t5-large, flan-t5-xl or even flan-t5-xxl or other large language models.The benefits of the extractive and generative approach could also be combined by adding a pointer network to the generative model.We will also investigate whether integrating a pointer network into the generative model can improve results.It would be also interesting to test the results in an actual evidence generation and comparison case study to assess whether the approach can indeed support the process of summarizing results from the clinical literature for a particular research question.

Fig. 1
Fig.1Illustration of both described approaches starting with the tokenized input and ending with the generated template instances

Fig. 2 Table 2
Fig. 2 Schema of the PICO data model used in the experiments

( 11 )
TEMPLATE := [start:SLOT] TEXT [end:SLOT] TEMPLATE non-terminal symbol TEMPLATE.The recursion base of production rules is given by where TEMPLATE is again a placeholder for the template name and [end:TEMPLATE] is a special token indicating the end of the linearization of the template TEM-PLATE.Production rules for TIs are described by

Table 1
Mean and standard deviation of the number of slot fillers per template in the used dataset, separated by type of disease.Numbers rounded to two decimal places

Table 3
Example similarities/compatibilities between four slot fillers, slot types in first row have been omitted

Table 4
Mean and standard deviation σ of test F 1 scores across 10 models trained using best-performing ( F 1 on validation dataset) configuration found in 30 trials of hyperparameter optimization.Numbers rounded to three decimal places, best configuration of each disease marked bold

Table 5
Mean slot F 1 values per template.Each cell shows mean and standard deviation of 10 training runs with the best found hyperparameters for best (w.r.t.validation F 1 score) configurations of each category.Numbers rounded to two decimal places, best values marked bold Type 2 diabetes F 1 ( ±σ)Glaucoma F 1 ( ±σ)

Table 6
Mean absolute deviation between ground truth and predicted template cardinality.Each cell shows mean and standard deviation of 10 training runs with the best found hyperparameters for best (w.r.t.validation F 1 score) configurations of each category.Numbers rounded to two decimal places, best values marked bold

Table 7
Mean test F 1 scores of the best models of each category per slot (mean and standard deviation of 10 training runs)

Table 8
Case study for disease Type 2 Diabetes.Multiple entries for same slot in same template instance separated by |

Table 10
Cardinality Evaluation Type 2 Diabetes Generative

Table 11
Cardinality Evaluation Type 2 Diabetes Extractive

Table 12
Cardinality Evaluation Glaucoma Generative

Table 13
Cardinality Evaluation Glaucoma Extractive